This report shows the QC steps for gene expression microarry data from GEO study, including:
Mannually change the variables for GEO ID (geo_id), data directory (datadir) and result directory (resdir).
Note that two variables, platform (platform), and normdata (whether the expression matrix is normalized), need to be manually re-defined in the following steps after look into the datasets. A shortname_func function is suggested to be updated.
Install the prerequisite R packages if they do not exist
Load the necessary libraries
Download the GEO series matrix files if available.
Show expression dataset features
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22215 features, 24 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM109204 GSM109205 ... GSM109227 (24 total)
## varLabels: title geo_accession ... Genotype:ch1 (37 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... 91952_at (22215 total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
## total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL96
Manual inspection: Re-define the platform variable “platform” (i.e. “Affymetrix”, “Agilent”, “Illumina”).
Obtain raw phenotype information from the GEO dataset and generated a summary of all the phenotypic variables for overview.
For continuous variables, show the summary table. For categorical variables, only show the first five levels of variables.
Generate a variable, suppldata (whether supplementary data are available), based on whether the column supplementary_file is none.
| title | counts |
|---|---|
| Control(ethanol)-treated sample at 24hr, biological rep1 | 1 |
| Control(ethanol)-treated sample at 24hr, biological rep2 | 1 |
| Control(ethanol)-treated sample at 24hr, biological rep3 | 1 |
| Control(ethanol)-treated sample at 2hr, biological rep1 | 1 |
| Control(ethanol)-treated sample at 2hr, biological rep2 | 1 |
| geo_accession | counts |
|---|---|
| GSM109204 | 1 |
| GSM109205 | 1 |
| GSM109206 | 1 |
| GSM109207 | 1 |
| GSM109208 | 1 |
| status | counts |
|---|---|
| Public on Jun 02 2006 | 24 |
| submission_date | counts |
|---|---|
| May 17 2006 | 24 |
| last_update_date | counts |
|---|---|
| Apr 24 2013 | 24 |
| type | counts |
|---|---|
| RNA | 24 |
| channel_count | counts |
|---|---|
| 1 | 24 |
| source_name_ch1 | counts |
|---|---|
| MCF10A-Myc cells treated with dexamethasone for 24hr | 3 |
| MCF10A-Myc cells treated with dexamethasone for 2hr | 3 |
| MCF10A-Myc cells treated with dexamethasone for 30 min | 2 |
| MCF10A-Myc cells treated with dexamethasone for 30min | 1 |
| MCF10A-Myc cells treated with dexamethasone for 4hr | 3 |
| organism_ch1 | counts |
|---|---|
| Homo sapiens | 24 |
| characteristics_ch1 | counts |
|---|---|
| Genotype: unknown | 24 |
| characteristics_ch1.1 | counts |
|---|---|
| Age: 35 | 24 |
| treatment_protocol_ch1 | counts |
|---|---|
| MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone (Dex) for 24hr. | 1 |
| MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 2 hr. | 2 |
| MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 24 hr. | 2 |
| MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 2hr. | 1 |
| MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 30 min. | 3 |
| molecule_ch1 | counts |
|---|---|
| total RNA | 24 |
| extract_protocol_ch1 | counts |
|---|---|
| Total RNA from each sample was extracted using Qiagenââ¬â¢s RNeasy Kit. | 24 |
| label_ch1 | counts |
|---|---|
| biotin | 24 |
| label_protocol_ch1 | counts |
|---|---|
| Biotinylated cRNA were prepared according to the standard Affymetrix protocol from 6 microg total RNA (Expression Analysis Technical Manual, 2001, Affymetrix). | 24 |
| taxid_ch1 | counts |
|---|---|
| 9606 | 24 |
| hyb_protocol | counts |
|---|---|
| Following fragmentation, 10 microg of cRNA were hybridized for 16 hr at 45C on Affymetrix human genome HG-U133A genechip. GeneChips were washed and stained in the Affymetrix Fluidics Station 450. | 24 |
| scan_protocol | counts |
|---|---|
| GeneChips were scanned using the Affymetrix GeneChipî Scanner 3000 7G. | 19 |
| GeneChips were scanned using the Affymetrix GeneChipî Scanner 3000 7G.. | 5 |
| description | counts |
|---|---|
| Gene expression data from embryos in slow phase of cellularisation. | 1 |
| Gene expression data from MCF10A-Myc cells treated with dex for 2 hr. | 1 |
| Gene expression data from MCF10A-Myc cells treated with Dex for 2 hr. | 1 |
| Gene expression data from MCF10A-Myc cells treated with dex for 24 hr. | 1 |
| Gene expression data from MCF10A-Myc cells treated with Dex for 24 hr. | 1 |
| data_processing | counts |
|---|---|
| The data were analyzed and normalized Robust Multi-array Average (RMA) algorithm. | 24 |
| platform_id | counts |
|---|---|
| GPL96 | 24 |
| contact_name | counts |
|---|---|
| Suzanne,Daniela,Conzen | 24 |
| contact_email | counts |
|---|---|
| sconzen@medicine.bsd.uchicago.edu | 24 |
| contact_phone | counts |
|---|---|
| (773)834-2604 | 24 |
| contact_laboratory | counts |
|---|---|
| Conzen Lab | 24 |
| contact_department | counts |
|---|---|
| Medicine | 24 |
| contact_institute | counts |
|---|---|
| The University of Chicago | 24 |
| contact_address | counts |
|---|---|
| 5841 S. Maryland Ave, SBRI J301, MC 2115 | 24 |
| contact_city | counts |
|---|---|
| Chicago | 24 |
| contact_state | counts |
|---|---|
| IL | 24 |
| contact_zip/postal_code | counts |
|---|---|
| 60637 | 24 |
| contact_country | counts |
|---|---|
| USA | 24 |
| data_row_count | counts |
|---|---|
| 22215 | 24 |
| Age:ch1 | counts |
|---|---|
| 35 | 24 |
| Genotype:ch1 | counts |
|---|---|
| unknown | 24 |
## Warning in if (levels(pheno.raw$supplementary_file) == "NONE") {: the
## condition has length > 1 and only the first element will be used
These raw phenotypic variables are not informative (e.g. description, characteristics_ch1 and source_name_ch1) and not created in a consice way. Select useful phenotype variables and manually modify them using a standard format (e.g. GEO_ID, Subject, Disease, Treatment, Age, Gender). This step requires mannual inspection.
Manual inspection: Re-define the variable “suppldata” (i.e. TRUE or FALSE) showing whether supplementary data is available.
Download supplementary raw data files if available. Analysis using raw intensity data is only for data from Affymetrix platform. For other platforms, the expression matrix is derive from ExpressionSet object (gse object from GEOquery), but the batch (scan date) information is obtained from the supplementary files and used for differential expression analysis.
For data from Affymetrix platform, the raw.data object is generated from importing supplementary raw data files (usually .cel files) using R oligo package. Scan date information is derived from the raw.data for batch effect adjustment.
For data from Agilent platform, the raw.data object is derived from GEO expression matrix. Scan date information is derived from the raw data files (usually in the 3rd line of a .txt file) for batch effect adjustment.
If supplementary data is not available, the raw.data object is derived from GEO expression matrix. Scan date information is unknown.
Assign phenotype data to raw data object.
Show the summary of phenotype variables and the sample size for different groups
| GEO_ID | Subject | Sample | Tissue | treatment_time | treatment_drug | Treatment | Disease | Age | ScanDate_Group |
|---|---|---|---|---|---|---|---|---|---|
| GSM109204 | rep1 | GSM109204_rep1 | MCF10A-Myc | 30min | baseline | baseline_30min | healthy | 35 | 05/23/02 |
| GSM109205 | rep1 | GSM109205_rep1 | MCF10A-Myc | 30min | dex | dex_30min | healthy | 35 | 05/23/02 |
| GSM109206 | rep1 | GSM109206_rep1 | MCF10A-Myc | 2hr | baseline | baseline_2hr | healthy | 35 | 05/23/02 |
| GSM109207 | rep1 | GSM109207_rep1 | MCF10A-Myc | 2hr | dex | dex_2hr | healthy | 35 | 05/23/02 |
| GSM109208 | rep1 | GSM109208_rep1 | MCF10A-Myc | 4hr | baseline | baseline_4hr | healthy | 35 | 05/23/02 |
| Tissue | Disease | Treatment | Count |
|---|---|---|---|
| MCF10A-Myc | healthy | baseline_24hr | 3 |
| MCF10A-Myc | healthy | baseline_2hr | 3 |
| MCF10A-Myc | healthy | baseline_30min | 3 |
| MCF10A-Myc | healthy | baseline_4hr | 3 |
| MCF10A-Myc | healthy | dex_24hr | 3 |
| MCF10A-Myc | healthy | dex_2hr | 3 |
| MCF10A-Myc | healthy | dex_30min | 3 |
| MCF10A-Myc | healthy | dex_4hr | 3 |
| ScanDate_Group | Count |
|---|---|
| 04/13/05 | 8 |
| 05/23/02 | 7 |
| 05/31/02 | 8 |
| 06/03/02 | 1 |
Assign colors to scan date or disease/treatment if scan date is not available.
If gene expression matrix data is used, check if they are normalized/log-transformed. After manual inspection, assign a logistic variable “normdata” (whether needs log2 transformation/normalization or not for QC). If normdata is FASE, we generate boxplots for log2-transformed and Quantile-normalization of log2-transformed data. Note that if the data are normalized, it is not likely to detect the outliers based on the intensity metrices.
Manual inspection: Re-define the variable “normdata” (i.e. TRUE or FALSE) showing whether the expression data is normalized or not.
The major QC steps and scoring methods for outliers were adapted from arrayQualityMetrics. The threshold to determine an outlier used in arrayQualityMetrics is the boxplot’s upper whisker, i.e. values beyond 1.5 times the interquartile range, which is also applied to our pipeline. The following QC metrics are included in a routine analysis. The QC metrics used for outlier detection are marked with an asterisk.
All the above steps can be processed in data from Affymetrix gene expression array. For data from other platforms, metrics for “raw”" proble intensities, MA plots, heatmap for array distance and PCAs can be processed.
Use the prepared phenotype file.
The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.
Compute the Kolmogorov-Smirnov statistic Ka between each array’s (i.e. sample) values (i.e. log2 transformed raw probe intensity values) and the pooled, overall distribution of the values.
## 0 outlier(s) are detected in the raw intensity metrics.
The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.
Overall RNA quality can be assessed by RNA degradation plots. In the gene expression array, each probe is represented by a probe set. Each probe set is 11-20 probes (pairs of oligos). This plot shows the average intensity of each probe across all probe sets, ordered from the 5´ to the 3´ end. It is expected that probe intensities are lower at the 5´ end of a probe set when compared to the 3´end as RNA degradation starts from the 5´end of a molecule. RNA which is too degraded will shows a very high slope from 5´ to 3´. Thus, the standardized slope of the RNA degradation plot serves as quantitative indicator of the RNA degradation.
This step requires R package affy that outputs the probes in each probe set matrix ordered from 5’ to 3’, while this function is not implemented in the oligo package.
There are two paired probe types: perfect match (PM) and of mismatched (MM) probes. A PM probe matches a strand of cDNA, while the corresponding MM probe differs from the PM by a change in the central nucleotide. A probe set is called present if the intensity value of PM is significantly larger than MM. However, the Affymetrix approach is under attack because between 15%-30% of the MM are greater than the PM. For some newer arrays, MM probes are not used. If the number of PMs is not equal to that of MMs, this might be a PM-only array.
If both PM and MM are present, the density curves of log2 PM ana MM intensities are generated, where MM probes are expected to have smaller log2-intensity at the peak than PM probes due to their nonspecific hybridization.
MA plots allow pairwise comparison of the log-intensity of each array to a reference array and identification of intensity-dependent biases.
The y-axis of the MA-plot shows the log-ratio intensity of one array to the reference median array, which is called M (minus). M = log2(I1)-log2(I2) (I1: the intensity of the array studied; I2: the median intensity across arrays)
The x-axis indicates the average log-intensity of both arrays, which is called A (add). A = 1/2*(log2(I1)+log2(I2))
It is expected that the probe levels do not differ systematically within a group of replicates, so that the MA-plot is centered on the y-axis (y=0 or M=0) from low to high intensities.
If any intensity value is zero in raw data (not normalized), add constant one to the intensity matrix, as Hoeffding’s statistics cannot handle infinite value.
Outlier detection is applied by computing a Hoeffding’s statistic (Da) on the joint distribution of A and M for each array, where an array with a Da >0.15 is designated as an outlier.
## 2 outliers are detected in the MA metrics.
## They are: GSM109212.cel.gz GSM109217.cel.gz
MA plots of the samples with the 4 highest and lowest Hoeffding’s statistics.
Spatial plots show an artificial colored image of an array’s spatial distribution of intensities that indicate spatial variation in an array. Log-intensities of probes are plotted by their corresponding spatial x and y-coordinate in the array and are expected to be uniformly distributed if the array data has good quality. The rank scale is applied for plotting as it has the potential to amplify patterns that are small in amplitude but systematic within an array.
The affy package is required to obtain the AffyBatch object that contains information of spatial x- and y-coordinate, while this function is not implemented in the oligo package.
Outlier detection is applied by computing a sum of the absolute values of low frequency Fourier coefficients (Fa) across all probe sets for each array, where an array with a Fa beyond the upper whisker is designated as an outlier.
## 1 outlier(s) are detected in the spatial metrics.
## They are: GSM109215.cel.gz
Spatial distribution plots of samples with the 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings. Outliers marked with * have Fa values of large scale spatial structures.
The normalized unscaled standard error (NUSE) and relative log expression (RLE) boxplots indicate probe set homogeneity in one array, where the metrics are derived from a fitted probe level model by the fitProbeLevelModel function (oligo). The RLE plots represent the distribution of the ratio between the log-intensity of a probe set and the median log-intensity of the corresponding probe set across all arrays, expected to be centered near 0, as a log scale is applied. Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ra) between RLE distribution for one array and the pooled array data, where an array with a Ra beyond the upper whisker is designated as an outlier
Compute the Kolmogorov-Smirnov statistic Ra between each array’s (i.e. sample) values (i.e. relative log expression values) and the pooled, overall distribution of the values. Detect outliers that are deviated from the threshold.
## 2 outlier(s) are detected in RLE metrics.
## They are: GSM109210.cel.gz GSM109215.cel.gz
Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.
The NUSE plots show the distribution of normalized standard error estimates, expected to be centered near 1. Outlier detection is applied by computing an upper hinge (Na) across all probe sets for each array, where an array with a Na beyond the upper whisker is designated as an outlier.
Compute 75% quantile Na of each array’s NUSE values Detect outliers that have larger Na deviated from the threshold.
2 outlier(s) are detected in NUSE metrics. They are: GSM109210.cel.gz GSM109215.cel.gz
Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.
Distance between arrays is evaluated using mean absolute difference of log-intensity/normalized intensity between each pair of arrays, where the hierarchical tree between arrays is created based on the distance, which is visualized by a heatmap and dendrogram.
The distance d(ab) between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In the formula (the dist2 function from genefilter package), d(ab) = mean | M(ai) - M(bi) |, where M(ai) is the value of the i-th probe on the a-th array.
If any intensity value is zero in raw data (not normalized), add constant one to the intensity matrix, as hierarchical clustering cannot handle infinite/NA value.
Outlier detection is applied by computing the sum of the distances of one array to all other arrays (Sa) (Sa=Sum(b)d(ab)), where an array with a Sa beyond the upper whisker is designated as an outlier.
## 0 outlier(s) are detected in sample distance metrics.
PCA demonstrates information of the expression dataset in a reduced number of dimensions. Clustering and PCA plots enable to assess to what extent arrays resemble each other, and whether this corresponds to the known resemblances of the samples.
| PC | Proportion of Variance (%) | Cumulative Proportion of Variance (%) |
|---|---|---|
| PC1 | 85.75 | 85.75 |
| PC2 | 5.7 | 91.45 |
| PC3 | 2.186 | 93.64 |
| PC4 | 1.211 | 94.85 |
| PC5 | 0.979 | 95.83 |
| PC6 | 0.64 | 96.47 |
| PC7 | 0.4505 | 96.92 |
| PC8 | 0.4126 | 97.33 |
| PC9 | 0.3629 | 97.7 |
| PC10 | 0.3118 | 98.01 |
PCA plots are generated using the first two principle components colored by known factors (e.g. treatment/disease conditions, donors and scan dates), visualizing similarities between arrays and these similarities’ correlation to batch effects.
The summary of outliers and detection methods
| Frequency | Method | |
|---|---|---|
| GSM109212.cel.gz | 1 | MA |
| GSM109217.cel.gz | 1 | MA |
| GSM109210.cel.gz | 2 | NUSE, RLE |
| GSM109215.cel.gz | 3 | NUSE, RLE, spatial |
Create a new column “QC_Pass” in the phenotype file. By default, samples detected as an outlier more than twice are assigned to 0 otherwise to 1.
## 1 outlier(s) are defined.
## They are: GSM109215
| Tissue | Treatment | Disease | Counts |
|---|---|---|---|
| MCF10A-Myc | baseline_24hr | healthy | 3 |
| MCF10A-Myc | baseline_2hr | healthy | 3 |
| MCF10A-Myc | baseline_30min | healthy | 3 |
| MCF10A-Myc | baseline_4hr | healthy | 3 |
| MCF10A-Myc | dex_24hr | healthy | 3 |
| MCF10A-Myc | dex_2hr | healthy | 3 |
| MCF10A-Myc | dex_30min | healthy | 3 |
| MCF10A-Myc | dex_4hr | healthy | 3 |
| Tissue | Treatment | Disease | Counts |
|---|---|---|---|
| MCF10A-Myc | baseline_24hr | healthy | 3 |
| MCF10A-Myc | baseline_2hr | healthy | 3 |
| MCF10A-Myc | baseline_30min | healthy | 3 |
| MCF10A-Myc | baseline_4hr | healthy | 3 |
| MCF10A-Myc | dex_24hr | healthy | 3 |
| MCF10A-Myc | dex_2hr | healthy | 2 |
| MCF10A-Myc | dex_30min | healthy | 3 |
| MCF10A-Myc | dex_4hr | healthy | 3 |
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-w64-mingw32/x64 (64-bit)
locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252
attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hgu133acdf(v.2.18.0), pd.hg.u133a(v.3.12.0), DBI(v.0.8), RSQLite(v.2.0), dplyr(v.0.7.4), pander(v.0.6.1), preprocessCore(v.1.40.0), devtools(v.1.13.5), Hmisc(v.4.1-1), Formula(v.1.2-2), survival(v.2.41-3), lattice(v.0.20-35), gplots(v.3.0.1), ggplot2(v.2.2.1), viridis(v.0.5.1), viridisLite(v.0.3.0), oligo(v.1.42.0), Biostrings(v.2.46.0), XVector(v.0.18.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), oligoClasses(v.1.40.0), GEOquery(v.2.46.15), Biobase(v.2.38.0) and BiocGenerics(v.0.24.0)
loaded via a namespace (and not attached): bitops(v.1.0-6), matrixStats(v.0.53.1), bit64(v.0.9-7), RColorBrewer(v.1.1-2), rprojroot(v.1.3-2), GenomeInfoDb(v.1.14.0), tools(v.3.4.4), backports(v.1.1.2), R6(v.2.2.2), affyio(v.1.48.0), rpart(v.4.1-13), KernSmooth(v.2.23-15), lazyeval(v.0.2.1), colorspace(v.1.3-2), nnet(v.7.3-12), withr(v.2.1.2), tidyselect(v.0.2.4), gridExtra(v.2.3), bit(v.1.1-12), compiler(v.3.4.4), htmlTable(v.1.11.2), xml2(v.1.2.0), DelayedArray(v.0.4.1), labeling(v.0.3), checkmate(v.1.8.5), caTools(v.1.17.1), scales(v.0.5.0), readr(v.1.1.1), stringr(v.1.3.0), digest(v.0.6.15), foreign(v.0.8-69), rmarkdown(v.1.9), base64enc(v.0.1-3), pkgconfig(v.2.0.1), htmltools(v.0.3.6), limma(v.3.34.9), htmlwidgets(v.1.0), rlang(v.0.2.0), rstudioapi(v.0.7), BiocInstaller(v.1.28.0), bindr(v.0.1.1), gtools(v.3.5.0), acepack(v.1.4.1), RCurl(v.1.95-4.10), magrittr(v.1.5), GenomeInfoDbData(v.1.0.0), Matrix(v.1.2-12), Rcpp(v.0.12.16), munsell(v.0.4.3), stringi(v.1.1.7), yaml(v.2.1.18), SummarizedExperiment(v.1.8.1), zlibbioc(v.1.24.0), plyr(v.1.8.4), grid(v.3.4.4), affxparser(v.1.50.0), blob(v.1.1.1), gdata(v.2.18.0), splines(v.3.4.4), hms(v.0.4.2), knitr(v.1.20), pillar(v.1.2.1), GenomicRanges(v.1.30.3), codetools(v.0.2-15), glue(v.1.2.0), evaluate(v.0.10.1), latticeExtra(v.0.6-28), data.table(v.1.10.4-3), foreach(v.1.4.4), gtable(v.0.2.0), purrr(v.0.2.4), tidyr(v.0.8.0), assertthat(v.0.2.0), ff(v.2.2-13), tibble(v.1.4.2), iterators(v.1.0.9), AnnotationDbi(v.1.40.0), memoise(v.1.1.0), cluster(v.2.0.6) and bindrcpp(v.0.2.2)